Online Chinese-Vietnamese Bilingual Topic Detection Based on RCRP Algorithm with Event Elements
نویسندگان
چکیده
On account of the characteristics of online Chinese-Vietnamese topic detection, we propose a Chinese-Vietnamese bilingual topic model based on the Recurrent Chinese Restaurant Process and integrated with event elements. First, the event elements, including the characters, the place and the time, will be extracted from the new dynamic bilingual news texts. Then the word pairs are tagged and aligned from the bilingual news and comments. Both the event elements and the aligned words are integrated into RCRP algorithm to construct the proposed bilingual topic detection model. Finally, we use the model to determine if the new documents will be grouped into a new category or classified into the existing categories, as a result, to detect a topic. Through the contrast experiment, the proposed model achieves a good effect on topic detection.
منابع مشابه
Efficient Online Inference for Infinite Evolutionary Cluster models with Applications to Latent Social Event Discovery
The Recurrent Chinese Restaurant Process (RCRP) is a powerful statistical method for modeling evolving clusters in large scale social media data. With the RCRP, one can allow both the number of clusters and the cluster parameters in a model to change over time. However, application of the RCRP has largely been limited due to the non-conjugacy between the cluster evolutionary priors and the Mult...
متن کاملMultilingual Topic Detection Using a Parallel Corpus
We have developed an approach for topic detection from multilingual news, in particular Chinese and English. We extract named entities such as people names, geographical location names, and organization names automatically from the news content by transformation-based linguistic taggers. These sets of named entities together with the remaining content terms form the basis of news representation...
متن کاملDiscovery of Unknown Events From Multi-lingual News
We have proposed a new approach to detect topically-related events from multi-lingual news sources. In particular, we are interested in Chinese and English on-line newswire stories. Three categories of named entities terms, namely, people names, geographical location names, and organization names, together with the story content terms constitute the basis for story representation. The named ent...
متن کاملRecurrent Chinese Restaurant Process with a Duration-based Discount for Event Identification from Twitter
Due to the fast development of social media on the Web, Twitter has become one of the major platforms for people to express themselves. Because of the wide adoption of Twitter, events like breaking news and release of popular videos can easily catch people’s attention and spread rapidly on Twitter, and the number of relevant tweets approximately reflects the impact of an event. Event identifica...
متن کاملExtending an on-line parallel corpus management system to handle specific types of structured documents
Parallel bilingual or multilingual corpora are often handled as collections of segments without any specific document organization. We describe SECTra_w, a web-oriented system which has been used for online MT evaluations, and has recently been extended to handle multimodal documents such as French-Chinese/Vietnamese/Hindi/Tamil interpreted bilingual spontaneous dialogues, mainly spoken but als...
متن کامل